Introduction to Bayesian Statistics

12/03/2020

e-CIS

Please fill out the anonymous electronic course evaluation. Feel free to leave your feedback, the course can always improve thanks to students’ input!

https://utdirect.utexas.edu/ctl/ecis/

Preliminaries (frequentist statistics)

Consider a simple linear model \[Y = \beta_{0} + X_{1} \beta_{1} + \dots + X_{p} \beta_{p} + \varepsilon\]

The statement above defines a sampling model for the observed data, given a set of unknown parameters $\boldsymbol{\theta} = (\beta_{0}, \beta_{1}, \dots, \beta_{p}, \sigma^{2})$, i.e.

\[p(Y_{i} \mid \boldsymbol{\theta} ) \sim N(\beta_{0} + x_{i1} \beta_{1} + \dots + x_{ip} \beta_{p}, \sigma^{2})\]

Our inferential goal is often that of obtaining point estimators $\hat{\boldsymbol{\theta}}$, interval estimators $(\hat{\boldsymbol{\theta}}_{L}, \hat{\boldsymbol{\theta}}_{U})$ and test hypotheses about $\boldsymbol{\theta}$.

Preliminaries (frequentist statistics)

Classical (frequentist) statistical inference is based on the assumption that $\boldsymbol{\theta}$ is fixed, and $Y$ is random (even after being observed).

A point estimator $\hat{\boldsymbol{\theta}} = \boldsymbol{\theta}(Y)$ is a summary of the data.

Under the assumption of repeated sampling (if I could see a countable sequence of new datasets) $\hat{\boldsymbol{\theta}}$ is random and has a sampling distribution.

For example, if the parameter of interest is $\theta = \text{average height}$, we use $\hat{\theta} = \bar{x}$ and $\hat{\theta} \sim N(\bar{x}, \sigma^{2} / n)$.

Example (benefit of gene therapy)

$n$ patients are assigned an experimental drug (e.g. Covid vaccine)
We aim to estimate the percentage of patients benefiting from treatment, say $\theta$

Possible scientific question:

Given data $Y$, what is the probability that the percentage of patients benefiting from the new drug is at least $50\%$?
Namely: $P(\theta > 0.50 \mid Y) = ?$

Note: this seemingly innocuous question cannot be answered with the tools of classical statistics.

Example (benefit of gene therapy)

The classical shortcut to this question is to set up a testing scenario $H_{0}: \theta \leq 0.5$, $H_{1}: \theta > 0.5$.

Consider a Z-test:

\[Z = \frac{\hat{\theta} - 0.5}{\sqrt{0.5 (1 - 0.5)}}\]

With p-value = $P(Z > z) = 1 − \Phi(z)$, the answer is now binary: accept or reject $H_{0}$.

Problems:

A decision not to reject the null hypothesis provides little quantitative information regarding the truth of the null hypothesis.
The rejection of a null hypothesis may occur even when evidence from the data strongly support its validity.

The Bayesian paradigm

Parameters are not fixed unknown quantities, but random quantities. Data are fixed after we observe them (no need to think about repeated sampling scenarios)
The Bayesian inferential paradigm requires the introduction of a prior distribution $p(\theta)$, encoding the information we have about $\theta$ before seeing any data
Bayesian statistical inference uses Bayes’ Theorem to combine prior information and sample data to make conclusions about a parameter of interest \[p(\theta \mid \mathbf{y}) = \frac{p(\mathbf{y} \mid \theta) p(\theta)}{p(\mathbf{y})} \propto p(\mathbf{y} \mid \theta) p(\theta)\]
We are done! Statistical inference is then reduced to a simple descriptive statistics for $p(\theta \mid \mathbf{y})$, no hacks, exceptions, ad hock adjustments

The Bayesian paradigm

Bayesian inference differs from classical inference in that it specifies a probability distribution for the parameter(s) of interest

Why using Bayesian methods?

We wish to specifically incorporate previous knowledge we have about a parameter of interest
To logically update our knowledge about the parameter after observing sample data
To make formal probability statements about the parameter of interest
To specify model assumptions and check model quality and sensitivity to these assumptions in a straightforward way

The Bayesian paradigm

Why do people use classical methods?

If the parameter(s) of interest is/are truly fixed (without the possibility of changing), as is possible in a highly controlled experiment
If there is no prior information available about the parameter(s)

Many reasons classical methods are more common than Bayesian methods are historical:

Many methods were developed in the context of controlled experiments
Bayesian methods require a bit more mathematical formalism
Historically (but not now) realistic Bayesian analyses had been unfeasible due to a lack of computing power

Different interpretations of probability

Frequentist definition of the probability of an event: If we repeat an experiment a very large number of times, what is the proportion of times the event occurs?
- Problem: For some situations, it is impossible to repeat (or even conceive of repeating) the experiment many times
- Example: The probability that Governor Abbott is re-elected in 2022

Subjective probability: Based on an individual’s degree of belief that an event will occur
- Example: A bettor is willing to risk up to $\$200$ betting that Abbott will be re-elected, in order to win $\$100$. The bettor’s subjective $P[\text{Abbott wins}] = 2/3$
- The Bayesian approach can naturally incorporate subjective probabilities about the parameter, where appropriate

Different interpretations of probability

Inference

Frequentist setting:

Specify a model for the data
Write the likelihood of the model
Maximize the likelihood, either by setting the derivatives to 0 or by numerical optimization

Bayesian setting:

Specify a model for the data
Write the likelihood of the model
Choose prior distributions
Calculate the posterior distribution:
- if the prior is conjugate (more to come), can do math
- otherwise, use Markov Chain Monte Carlo (MCMC) algorithms by sampling the distribution of the parameters

Example conjugate priors

Suppose we observe $y$ is the number of infected individuals in a sample of size $n$. We want to estimate the prevalence of a disease $\theta$. A reasonable sampling model is

\[y \sim \text{Binomial}(n, \theta)\]

\[p(y \mid \theta) = {n\choose y} \theta^{y} (1 - \theta)^{n - y}\]

A reasonable prior for $\theta$ could belong to the Beta family, so that

\[\theta \sim \text{Beta}(a, b)\]

\[p(\theta) = C \theta^{a−1} (1 - \theta)^{b−1}\]

It turns out that this prior is conjugate to the Binomial likelihood.

Beta distribution

A flexible family of distributions representing prior understanding about a proportion.

Posterior distribution

The posterior distribution is simply

\[p(\theta \mid y) \propto p(y \mid \theta) p(\theta) = \theta^{a−1} (1 - \theta)^{b−1} \theta^{y} (1 - \theta)^{n-y}\]

Which we recognize as the kernel of a Beta random variable

\[\theta \mid y \sim \text{Beta}(a + y, b + n - y)\]

Suppose we see 1 infection among 20 patients and assume a relative prior ignorance, so that $\theta \sim \text{Beta}(1, 1)$; then we get $p(\theta \mid y) = \text{Beta}(2,20)$

Confidence / Credible Intervals

Frequentist Intervals (Confidence Intervals)

Inference is based on two statistics $A(y)$ and $B(y)$ s.t. $P(A < \theta < B) = \gamma$
This is a probability statement about the joint distribution of two random variables $A$ and $B$
After the data is collected $A = a$ and $B = b$ and it is not possible to assign probability to the event that $\theta$ lies in $(a, b)$

Bayesian Intervals (Credible Intervals)

Inference is based on $p(\theta \mid Y)$ and credible intervals may be defined as $(\alpha, \beta)$ s.t. $p(\alpha < \theta < \beta \mid y) = \gamma$
The undergraduate student’s dream becomes reality and the probability that $\theta$ lies in $(\alpha,\beta)$ is indeed $\gamma$

Connection to shrinkage methods

When the number of predictors $p$ is large, it is often necessary to regularize the likelihood by defining penalties on the regression coefficients $\boldsymbol{\beta}$. Penalized log likelihood functions take the form

\[\ell_{\lambda} (y; \boldsymbol{\beta}, \sigma^{2}) = \sum_{i=1}^{n} \log p(y_{i} \mid \boldsymbol{\beta}, \sigma^{2}) - g_{\lambda}(\boldsymbol{\beta})\]

where $g_{\lambda}(\boldsymbol{\beta})$ is a penalty function and $\lambda$ is a tunable scale parameter.

Examples:

L2 penalty: $g_{\lambda}(\boldsymbol{\beta}) = \lambda \sum_{j=1}^{p} \beta_{j}^{2}$ leading to Ridge regression
L1 penalty: $g_{\lambda}(\boldsymbol{\beta}) = \lambda \sum_{j=1}^{p} |\beta_{j}|$ leading to Lasso

Frequentist inference usually treats $\lambda$ as a nuisance parameter and estimates $\boldsymbol{\beta}$ given a value of $\lambda$ The “optimal” $\lambda$ is then selected via cross-validation.

Connection to shrinkage methods

We notice that

\[L_{\lambda} (y; \boldsymbol{\beta}, \sigma^{2}) = \prod_{i=1}^{n} p(y_{i} \mid \boldsymbol{\beta}, \sigma^{2}) \times e^{- g_{\lambda}(\boldsymbol{\beta})}\]

which is readily interpreted as

\[p(Y \mid \boldsymbol{\beta}, \sigma^{2}) p(\boldsymbol{\beta} \mid \lambda)\]

where $p(\boldsymbol{\beta} \mid \lambda) \propto e^{- g_{\lambda}(\boldsymbol{\beta})}$ is the prior.

Bayesian methods naturally shrink the parameter values thanks to the incorporation of the prior effect! They are more robust and overfit less often.

e-CIS

Preliminaries (frequentist statistics)

Preliminaries (frequentist statistics)

Example (benefit of gene therapy)

Example (benefit of gene therapy)

The Bayesian paradigm

The Bayesian paradigm

The Bayesian paradigm

The Bayesian paradigm

Different interpretations of probability

Different interpretations of probability

Inference

Example conjugate priors

Beta distribution

Posterior distribution

Confidence / Credible Intervals

Connection to shrinkage methods

Connection to shrinkage methods

Question time